Explainable Artificial Intelligence methods like LIME and SHAP are used a lot to understand what machine learning models predict.. Most research focuses on generating explanations not checking if they are consistent across different methods.
This paper presents a framework called Consistency-Aware Explainable AI that checks how well different explanation meth-ods agree. It uses a score called Explanation Consistency Score, which combines two metrics.
WetestedthisframeworkontheUCIHeartDiseasedatasetus-ingLogisticRegression,RandomForestandXGBoostclassifiers. Weevaluatedexplanationconsistencyacross50testinstancesfor each model.
Theresultsshowthatbeinggoodatpredictingdoesnot mean an explanation is consistent. Logistic Regression got the Explanation Consistency Score while Random Forest got the lowest even though it was best at classifying.
TheExplanationConsistencyScoreprovidesanreliablewaytocheckexplanationconsistencyacrossmachinelearningmodels. TheproposedECSframeworkprovidesancomputationally lightweightapproach,forevaluatingexplanationconsistency
Across machinelearningmodels.
Introduction
The paper addresses a key limitation in Explainable Artificial Intelligence (XAI): while methods like LIME and SHAP are widely used to explain machine learning predictions, there is no standard way to measure whether different explanation methods agree with each other. This is especially important in sensitive domains like healthcare and finance, where inconsistent explanations can reduce trust and lead to poor decisions.
To solve this, the authors propose a new metric called the Explanation Consistency Score (ECS), which combines Jaccard Similarity (feature overlap) and Spearman Rank Correlation (feature ranking agreement) to measure how consistent LIME and SHAP explanations are. They also introduce a general evaluation framework called CA-XAI that can be applied to any pair of feature-importance-based explanation methods.
Experiments are conducted on the UCI Heart Disease dataset using three machine learning models: Logistic Regression, Random Forest, and XGBoost. For each model, LIME and SHAP explanations are generated and compared using ECS.
The results show that explanation consistency varies depending on model complexity. Simpler models tend to produce more consistent explanations, while more complex ensemble models reduce agreement between LIME and SHAP. This highlights a trade-off between predictive performance and interpretability consistency.
Conclusion
This paper is about the Explanation Consistency Score, which’s a way to measure how well two explanation methods, LIME and SHAP agree with each other. The Explanation Consistency Score is a mix of two things: Jaccard Similarity and Spearman Rank Correlation.
TheresearcherstriedouttheExplanationConsistencyScore with a few models, like Logistic Regression, Random Forest andXGBoostonadatasetaboutheartdisease.Theyfoundthat Logistic Regression had the Explanation Consistency Score, which was 0.5576 and it was also very stable. On the hand Random Forest was really good at predicting things but its Explanation Consistency Score was the lowest, at 0.3879.
Thismakesusthinkthatmaybemodelsthataretoocomplex do not do a job of explaining things in a consistent way. So when we are choosing a model we should not just think about how it predicts things but also about how well it explains things.
The good thing about the Explanation Consistency Score is that it is not hard to compute and it works with any model. Itis also easy to add to the way we already evaluate models.
There are a things to keep in mind though. The researchers only tried this out on one dataset so we do not know if it will workthewayonotherdatasets.TheyalsoonlyusedLIMEand SHAPsowedonotknowwhatwouldhappenwithexplanation methods.. They gave equal weight to Jaccard Similarity and Spearman Rank Correlation which might not always be the best thing to do.
Theresearchersdideverythingtheycouldtomakesuretheir results are reliable, by using the random seeds every time and making their code public.
In the future the researchers want to try the Explanation Consistency Score with complex models like deep neural networks and see if they can make it work better by changing the way they weigh Jaccard Similarity and Spearman Rank Correlation.
References
[1] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust You?:Explaining the Predictions of Any Classifier,” in Proc. ACM SIGKDDInt.Conf.KnowledgeDiscoveryandDataMining,pp.1135–1144,2016.
[2] S. M. Lundberg and S.-I. Lee, “A Unified Approach to InterpretingModel Predictions,” in Advances in Neural Information ProcessingSystems (NIPS), vol. 30, pp. 4765–4774, 2017.
[3] A.AdadiandM.Berrada,“PeekingInsidetheBlack-Box:ASurveyon Explainable Artificial Intelligence (XAI),” IEEE Access, vol. 6, pp.52138–52160, 2018.
[4] R.Guidotti,A.Monreale,S.Ruggieri,F.Turini,F.Giannotti,and Pedreschi,“ASurveyofMethodsforExplainingBlackBoxModels,”ACM Computing Surveys, vol. 51, no. 5, pp. 1–42, 2019.
[5] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp.5–32, 2001.
[6] T.ChenandC.Guestrin,“XGBoost:AScalableTreeBoostingSystem,”in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and DataMining, pp. 785–794, 2016.
[7] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, 2nd ed.New York, NY: Wiley, 2000.
[8] P. Jaccard, “The Distribution of Flora in the Alpine Zone,” NewPhytologist, vol. 11, no. 2, pp. 37–50, 1912.
[9] C.Spearman,“TheProofandMeasurementofAssociationBetweenTwoThings,” American Journal of Psychology, vol. 15, no. 1, pp. 72–101,1904.
[10] F.Pedregosaetal.,“Scikit-learn:MachineLearninginPython,”Journalof Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[11] S. M. Lundberg et al., “From Local Explanations to Global Understand-ing with Explainable AI for Trees,” Nature Machine Intelligence, vol.2, no. 1, pp. 56–67, 2020.
[12] A. Janosi, W. Steinbrunn, M. Pfisterer, and R. Detrano, “Heart DiseaseDataset,”UCIMachineLearningRepository, 1988.[Online]. Available:https://archive.ics.uci.edu/ml/datasets/heart+disease